* Add url create/edit query paramter to clear cache
* Add refresh bookmark metadata button in create/edit bookmark page
* Fix refresh bookmark metadata when editing existing bookmark
* Add bulk refresh metadata functionality
* Fix test cases for bulk view dropdown selection list
* Allow bulk metadata refresh when background tasks are disabled
* Move load preview image call on refresh metadata
* Update bookmark modified time on metadata refresh
* Rename function to align with convention
* Add tests for refresh task
* Add tests for bookmarks service refresh metadata
* Add tests for bookmarks api disable cache on check
* Remove bulk refresh metadata when background tasks disabled
* Refactor refresh metadata task
* Remove unnecessary call
* Fix testing mock name
* Abstract clearing metadata cache
* Add test to check if load page is called twice when cache disabled
* Remove refresh button for new bookmarks
* Remove strict disable cache is true check
* Refactor refresh metadata form logic into its own function
* move button and highlight changes
* polish and update tests
---------
Co-authored-by: Sascha Ißbrücker <sascha.issbruecker@gmail.com>
* add migration for merging fields
* remove usage of website title and description
* keep empty website title and description in API for compatibility
* restore scraping in API and add option for disabling it
* document API scraping behavior
* remove deprecated fields from API docs
* improve form layout
* cleanup migration
* cleanup website loader
* update tests
* Support pytest for running tests
* Support extracting description from meta og:description property
* Revert changes to TOC
* Add test
---------
Co-authored-by: Sascha Ißbrücker <sascha.issbruecker@gmail.com>
Limits the size of scraped HTML documents to prevent out of memory errors. The scraper will stop reading from the response when it encounters the closing head tag, or if the read content's size exceeds a max limit.
Fixes#345
* Avoid stall on web scraping
This patch fixes stall on web scraping.
I encountered a stall (scraping never ends) when adding
a bookmark of some site.
To avoid this case, adding a timeout parameter at requests.get()
function is a solution.
Signed-off-by: Taku Izumi <admin@orz-style.com>
* Avoid character corruption of scraping some Japanese sites
This patch fixes character corruption of scraping some Japanese
sites. To avoid character corruption, I use r.content instead
of r.text in load_page function.
The reason of character corruption is encoding problem, I think.
r.text handles data as unicode encoded text, so if scraping
web site's charset is not unicode encoded, character corruption
occurs. r.content handles data as str[], we can avoid encoding
problem.
Signed-off-by: Taku Izumi <admin@orz-style.com>
* use charset_normalizer to determine response encoding
Co-authored-by: Taku Izumi <admin@orz-style.com>
Co-authored-by: Sascha Ißbrücker <sascha.issbruecker@googlemail.com>