Python library to mirror webpage and websites.
Created By : Raja Tomar
License : MIT
Email: [email protected]
Python websites and webpages cloning at ease. Web Scraping or Saving Complete webpages and websites with python.
Web scraping and archiving tool written in Python Archive any online website and its assets, css, js and images for offilne reading, storage or whatever reasons. It's easy with
pywebcopy.
Why it's great? because it -
robots.txt
Email me at
[email protected]of any query :)
pywebcopyis available on PyPi and is easily installable using
pip
$ pip install pywebcopy
You are ready to go. Read the tutorials below to get started.
You should always check if the latest pywebcopy is installed successfully.
>>> import pywebcopy >>> pywebcopy.__version__ 6.0.0
Your version may be different, now you can continue the tutorial.
To save any single page, just type in python console
from pywebcopy import save_webpagekwargs = {'project_name': 'some-fancy-name'}
save_webpage( url='http://example-site.com/index.html', project_folder='path/to/downloads', **kwargs )
To save full website (This could overload the target server, So, be careful)
from pywebcopy import save_websitekwargs = {'project_name': 'some-fancy-name'}
save_website( url='http://example-site.com/index.html', project_folder='path/to/downloads', **kwargs )
Running tests is simple and doesn't require any external library. Just run this command from root directory of pywebcopy package.
$ python -m pywebcopy run-tests
pywebcopyhave a very easy to use command-line interface which can help you do task without having to worrying about the inner long way. - #### Getting list of commands
shell $ python -m pywebcopy -- --help- #### Using apis
shell $ python -m pywebcopy save_webpage http://google.com E://store// --bypass_robots=True or $ python -m pywebcopy save_website http://google.com E://store// --bypass_robots- #### Running tests
shell $ python -m pywebcopy run_tests
Most of the time authentication is needed to access a certain page. Its real easy to authenticate with
pywebcopybecause it usage an
requests.Sessionobject for base http activity which can be accessed through
pywebcopy.SESSIONattribute. And as you know there are ton of tutorials on setting up authentication with
requests.Session.
Here is a basic example of simple http auth - ```python import pywebcopy
pywebcopy.SESSION.headers.update({ 'auth': {'username': 'password'}, 'form': {'key1': 'value1'}, })
kwargs = { 'url': 'http://localhost:5000', 'projectfolder': 'e://savedpages//', 'projectname': 'mysite' } pywebcopy.config.setupconfig(**kwargs) pywebcopy.savewebpage(**kwargs)
2.1
WebPage
class
WebPage
class, the engine of this saving actions. You can use this class to access many more methods to customise the process with.
Creating the instance
You can directly import this class from pywebcopy
package.
from pywebcopy import WebPage
wp = WebPage()
fetching the html source from internet
You can tell it to fetch the source from the
internet, it then uses requests
module to fetch it
for you.
You can pass in the several params
which requests.get()
would accept
e.g. proxies, auth etc.
from pywebcopy import WebPage
wp = WebPage()
# You can choose to load the page explicitly using
# `requests` module with params `requests` would take
url = 'http://google.com'
params = {
'auth': '[email protected]',
'proxies': 'localhost:5000',
}
wp.get(url, **params)
providing your own opened file You can also provide opened source handles directly
from pywebcopy import WebPage
wp = WebPage()
# You can choose to set the source yourself
handle = open('file.html', 'rb')
wp.set_source(handle)
WebPage
properties and methodsApis which WebPage
object exposes after creating
through any method described above
.file_path
property
Read-only location at which this file will end up
when you try to save the parsed html source
To change this location you have to manipulate the
.utx
property of the WebPage
class. You can
look it up below.
.project_path
property
Read-only location at which all the files will end up
when you try to save the complete webpage.
To change this location you have to manipulate the
.utx
property of the WebPage
class. You can
look it up below.
.save_assets
method
This methods saves all the css
, js
, images
, fonts
etc.
in the folder you setup through property .project_path
.
from pywebcopy import WebPage
wp = WebPage()
wp.get('http://google.com')
wp.save_html()
#> a .html file would be saved at
.save_html
method
After setting up the WebPage
instance you can
use this method to save a local copy of the parsed
and modified html at .file_path
property value.
from pywebcopy import WebPage
wp = WebPage()
wp.get('http://google.com')
wp.save_html()
#> a .html file would be saved at location which
#> `.file_path` property returns
.save_complete
method
This is the important api which you would be using
frequently for saving or cloning a webpage for later
reading or whatever the use case would be.
This methods saves all the css
, js
, images
, fonts
etc.
in the same order as a most browser would do when you will click on
the save page
option in the right click menu.
if you want complete webpage with css, js and images
from pywebcopy import WebPage
wp = WebPage()
wp.get('http://google.com')
wp.save_complete()
Multiple scraping packages are wrapped up in one object which you can use to unlock the best of all those libraries at one go without having to go through the hassle of instantiating each one of those libraries
> To use all the methods and properties documented below > just create a object once as described
```python from pywebcopy import MultiParser
import requests
req = requests.get('http://google.com')
html = req.content
encoding = req.encoding
wp = MultiParser(html, encoding)
All code follows above code
#### BeautifulSoup methods are supported
you can also use any beautiful_soup methods on it
>>> links = wp.bs4.find_all('a')['//docs.python.org/3/tutorial/', '/about/apps/', 'https://github.com/python/pythondotorg/issues', '/accounts/login/', '/download/other/']
####
lxmlis completely supported
You can use any lxml methods on it. Read more about lxml at
http://lxml.de/
>>> wp.lxml.xpath('//a', ..) [,]
pyqueryis Fully supported
You can use PyQuery methods on it .Read more about pyquery at
https://pythonhosted.org/pyquery/
>>> wp.pq.select(selector, ..) ...
####
lxml.xpathis also supported
xpath is also natively supported which retures a :class:
requests_html.ElementSee more at
https://html.python-requests.org
>>> wp.xpath('a') ['']
#### select only elements containing certain text
Provided through the
requests_htmlmodule.
>>> wp.find('a', containing='kenneth') >>> [, ...]
Crawlerobject
This is a subclass of
WebPageclass and can be used to mirror any website.
>>> from pywebcopy import Crawler, config >>> url = 'http://some-url.com/some-page.html' >>> project_folder = '/home/desktop/' >>> project_name = 'my_project' >>> kwargs = {'bypass_robots': True} # You should always start with setting up the config or use apis >>> config.setup_config(url, project_folder, project_name, **kwargs)Create a instance of the webpage object
>>> wp = Crawler()
If you want to you can use
requests
to fetch the pages>>> wp.get(url, **{'auth': ('username', 'password')})
Then you can access several methods like
>>> wp.crawl()
You can easily make a beginners mistake or could get confuse, thus here are the common errors and how to correct them if you are facing them.
pywebcopy.exceptions.AccessError
If you are getting
pywebcopy.exceptions.AccessErrorException. then check if website allows scraping of its content.
>>> import pywebcopy >>> pywebcopy.config['bypass_robots'] = Truerest of your code follows..
Overwrite existing files when copying
If you want to overwrite existing files in the directory then use the over_write config key.
import pywebcopy pywebcopy.config['over_write'] = Truerest of your code follows..
Changing your project name
By default the pywebcopy creates a directory inside project_folder with the url you have provided but you can change this using the code below
>>> import pywebcopy >>> pywebcopy.config['project_name'] = 'my_project'rest of your code follows..
Particular webpage can be saved easily using the following methods.
Note: if you get
pywebcopy.exceptions.AccessErrorwhen running any of these code then use the code provided on later sections.
save_webpage()
Webpage can easily be saved using an inbuilt funtion called
.save_webpage()which takes several arguments also.
>>> from pywebcopy import save_webpage >>> save_webpage(project_url='http://google.com', project_folder='c://Saved_Webpages/',)
This use case is slightly more powerful as it can provide every functionallity of the WebPage class.
>>> from pywebcopy import WebPage, config >>> url = 'http://some-url.com/some-page.html'You should always start with setting up the config or use apis
>>> config.setup_config(url, project_folder, project_name, **kwargs)
Create a instance of the webpage object
>>> wp = WebPage()
If you want to use
requests
to fetch the page then>>> wp.get(url)
Else if you want to use plain html or urllib then use
>>> wp.set_source(object_which_have_a_read_method, encoding=encoding) >>> wp.url = url # you need to do this if you are using set_source()
Then you can access several methods like
>>> wp.save_complete() >>> wp.save_html() >>> wp.save_assets()
This Webpage object contains every methods of the Webpage() class and thus
can be reused for later usages.
I told you earlier that Webpage object is powerful and can be manipulated in any ways.
One feature is that the raw html is now also accepted.
>>> from pywebcopy import WebPage, config>>> HTML = open('test.html').read()
>>> base_url = 'http://example.com' # used as a base for downloading imgs, css, js files. >>> project_folder = '/saved_pages/' >>> config.setup_config(base_url, project_folder)
>>> wp = WebPage() >>> wp.set_source(HTML) >>> wp.url = base_url >>> wp.save_complete()
Use caution when copying websites as this can overload or damage the servers of the site and rarely could be illegal, so check everything before you proceed.
save_website()
Using the inbuilt api
.save_website()which takes several arguments.
>>> from pywebcopy import save_website>>> save_website(project_url='http://localhost:8000', project_folder='e://tests/')
By creating a Crawler() object which provides several other functions as well.
>>> from pywebcopy import Crawler, config>>> config.setup_config(project_url='http://localhost:5000/', project_folder='e://tests/', project_name='LocalHost')
>>> crawler = Crawler() >>> crawler.crawl()
pywebcopyis highly configurable. You can setup the global object using the methods exposed by the
pywebcopy.configobject.
Ways to change the global configurations are below -
Using the method
.setup_configon global
pywebcopy.configobject
You can manually configure every configuration by using a
.setup_configcall.
>>> import pywebcopy>>> url = 'http://example-site.com/index.html' >>> download_loc = 'path/to/downloads/' >>> project = 'my_project'
>>> pywebcopy.config.setup_config(url, download_loc, project, **kwargs)
done!
Now check
>>> pywebcopy.config.get('project_url') 'http://example-site.com/index.html'
>>> pywebcopy.config.get('project_folder') 'path/to/downloads'
>>> pywebcopy.config.get('project_name') 'example-site.com'
You can also change any config even after
the
setup_config
callpywebcopy.config['url'] = 'http://url-changed.com'
rest of config remains unchanged
Done!
Passing in the config vars directly to the
global apis e.g.
.save_webpage
To change any configuration, just pass it to the
apicall.
Example:
from pywebcopy import save_webpagekwargs = { 'project_url': 'http://google.com', 'project_folder': '/home/pages/', 'project_name': ... }
save_webpage(**kwargs)
configurations
below is the list of
configkeys with their
defaultvalues :
# writes the trace output and log file content to console directly 'DEBUG': Falsemake zip archive of the downloaded content
'zip_project_folder': True
delete the project folder after making zip archive of it
'delete_project_folder': False
to download css file or not
'LOAD_CSS': True
to download images or not
'LOAD_IMAGES': True
to download js file or not
'LOAD_JAVASCRIPT': True
to overwrite the existing files if found
'OVER_WRITE': False
list of allowed file extensions
shortend for readability
'ALLOWED_FILE_EXT': ['.html', '.css', ...]
log file path
'LOG_FILE': None
name of the mirror project
'PROJECT_NAME': website-name.com
define the base directory to store all copied sites data
'PROJECT_FOLDER': None
DANGER ZONE
CHANGE THESE ON YOUR RESPONSIBILITY
NOTE: Do not change unless you know what you're doing
requests headers to be shown on requests made to server
'http_headers': {...}
bypass the robots.txt restrictions
'BYPASS_ROBOTS' : False
You can contribute in many ways
If you have any suggestions or fixes or reports feel free to mail me :)
I built many utils and classes in this project to ease the tasks I was trying to do.
But, these task are also suitable for general purpose use.
So, if you want, you can help in generating suitable
documentationfor these undocumented ones, then you can always create and pull request or email me.
Python Firelibrary.
config.setup_paths.
pywebcopy.__all__attr generation.
WebPageclass now doesnt take any argument (breaking change)
WebPageclass has new methods
WebPage.getand
WebPage.set_source
core.setup_configfunction is changed to
config.setup_config.
utils.tracedecorator, which will print function_name, args, kwargs and return value when debug config key is True.
config.configkey called
parser
user-agentkey cracked webpages. You can now use any browser's user-agent id and it will get exact same page downloaded.
generators.extract_css_urlswhich was caused by
strand
bytesdifference in python3.
errorhandlingto required functions
initfunction is replaced with
save_webpage
configautomation functions are added -
core.setup_config(creates every ideal config just from url and download location)
config.reset_config(resets the configuration to default state)
config.update_config(manual-mode version of
core.setup_config)
structures.WebPageadded
generators.generate_style_mapand
generators.generate_relative_pathsto a single function
generators.generate_style_map
exceptionsadded
urlis checked and resolved of any redirection before starting any work functions.
initvars :
mirrors_dirand
clean_upwere fixed which cleaned the dir before the log was completely written.
initcall now takes
urlarg by default and could raise a error when not supplied
zipfileand
exceptionshandling to prevent errors and eventual archive corruption
structures.WebPage