Commit Graph

6 Commits

Author SHA1 Message Date
Luca
c2d8cde86b Trim website metadata title and description (#383)
* feat: trim fetched metadata placeholders

* feat: implement trimming serverside

* Add website loader tests

* Address review comments

Co-authored-by: Sascha Ißbrücker <sascha.issbruecker@gmail.com>
2023-01-12 21:06:36 +01:00
Sascha Ißbrücker
2fd7704816 Limit document size for website scraper (#354)
Limits the size of scraped HTML documents to prevent out of memory errors. The scraper will stop reading from the response when it encounters the closing head tag, or if the read content's size exceeds a max limit.

Fixes #345
2022-10-07 21:18:18 +02:00
Dustin Blackman
b53bd9f112 Bump waybackpy to 3.0.6 (#281)
* fix wayback

* fix tests

* Reuse user agent from website loader

Co-authored-by: Sascha Ißbrücker <sascha.issbruecker@gmail.com>
2022-07-03 06:26:16 +02:00
Sascha Ißbrücker
e08bf9fd03 Fake request headers to reduce bot detection (#263)
Co-authored-by: Sascha Ißbrücker <sascha.issbruecker@gmail.com>
2022-05-21 13:25:32 +02:00
Taku Izumi
937858cf58 Fix website scraper decoding content incorrectly (#126)
* Avoid stall on web scraping

This patch fixes stall on web scraping.
I encountered a stall (scraping never ends) when adding
a bookmark of some site.
To avoid this case, adding a timeout parameter at requests.get()
function is a solution.

Signed-off-by: Taku Izumi <admin@orz-style.com>

* Avoid character corruption of scraping some Japanese sites

This patch fixes character corruption of scraping some Japanese
sites. To avoid character corruption, I use r.content instead
of r.text in load_page function.

The reason of character corruption is encoding problem, I think.
r.text handles data as unicode encoded text, so if scraping
web site's charset is not unicode encoded, character corruption
occurs. r.content handles data as str[], we can avoid encoding
problem.

Signed-off-by: Taku Izumi <admin@orz-style.com>

* use charset_normalizer to determine response encoding

Co-authored-by: Taku Izumi <admin@orz-style.com>
Co-authored-by: Sascha Ißbrücker <sascha.issbruecker@googlemail.com>
2021-08-25 10:16:23 +02:00
Sascha Ißbrücker
e07da529f1 Preview website title + description in bookmark form
Fix unnecessary selects when rendering bookmarks
2019-07-02 01:28:02 +02:00