Skip to content
This repository was archived by the owner on Apr 26, 2024. It is now read-only.

Commit ba7a91a

Browse files
authored
Refactor oEmbed previews (#10814)
The major change is moving the decision of whether to use oEmbed further up the call-stack. This reverts the _download_url method to being a "dumb" functionwhich takes a single URL and downloads it (as it was before #7920). This also makes more minor refactorings: * Renames internal variables for clarity. * Factors out shared code between the HTML and rich oEmbed previews. * Fixes tests to preview an oEmbed image.
1 parent 2843058 commit ba7a91a

File tree

5 files changed

+299
-220
lines changed

5 files changed

+299
-220
lines changed

changelog.d/10814.feature

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Improve oEmbed previews by processing the author name, photo, and video information.

docs/development/url_previews.md

Lines changed: 13 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -25,23 +25,28 @@ When Synapse is asked to preview a URL it does the following:
2525
3. Kicks off a background process to generate a preview:
2626
1. Checks the database cache by URL and timestamp and returns the result if it
2727
has not expired and was successful (a 2xx return code).
28-
2. Checks if the URL matches an oEmbed pattern. If it does, fetch the oEmbed
29-
response. If this is an image, replace the URL to fetch and continue. If
30-
if it is HTML content, use the HTML as the document and continue.
31-
3. If it doesn't match an oEmbed pattern, downloads the URL and stores it
32-
into a file via the media storage provider and saves the local media
33-
metadata.
34-
5. If the media is an image:
28+
2. Checks if the URL matches an [oEmbed](https://oembed.com/) pattern. If it
29+
does, update the URL to download.
30+
3. Downloads the URL and stores it into a file via the media storage provider
31+
and saves the local media metadata.
32+
4. If the media is an image:
3533
1. Generates thumbnails.
3634
2. Generates an Open Graph response based on image properties.
37-
6. If the media is HTML:
35+
5. If the media is HTML:
3836
1. Decodes the HTML via the stored file.
3937
2. Generates an Open Graph response from the HTML.
4038
3. If an image exists in the Open Graph response:
4139
1. Downloads the URL and stores it into a file via the media storage
4240
provider and saves the local media metadata.
4341
2. Generates thumbnails.
4442
3. Updates the Open Graph response based on image properties.
43+
6. If the media is JSON and an oEmbed URL was found:
44+
1. Convert the oEmbed response to an Open Graph response.
45+
2. If a thumbnail or image is in the oEmbed response:
46+
1. Downloads the URL and stores it into a file via the media storage
47+
provider and saves the local media metadata.
48+
2. Generates thumbnails.
49+
3. Updates the Open Graph response based on image properties.
4550
7. Stores the result in the database cache.
4651
4. Returns the result.
4752

synapse/rest/media/v1/oembed.py

Lines changed: 88 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -12,30 +12,30 @@
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
1414
import logging
15+
import urllib.parse
1516
from typing import TYPE_CHECKING, Optional
1617

1718
import attr
1819

1920
from synapse.http.client import SimpleHttpClient
21+
from synapse.types import JsonDict
22+
from synapse.util import json_decoder
2023

2124
if TYPE_CHECKING:
2225
from synapse.server import HomeServer
2326

2427
logger = logging.getLogger(__name__)
2528

2629

27-
@attr.s(slots=True, auto_attribs=True)
30+
@attr.s(slots=True, frozen=True, auto_attribs=True)
2831
class OEmbedResult:
29-
# Either HTML content or URL must be provided.
30-
html: Optional[str]
31-
url: Optional[str]
32-
title: Optional[str]
33-
# Number of seconds to cache the content.
34-
cache_age: int
35-
36-
37-
class OEmbedError(Exception):
38-
"""An error occurred processing the oEmbed object."""
32+
# The Open Graph result (converted from the oEmbed result).
33+
open_graph_result: JsonDict
34+
# Number of seconds to cache the content, according to the oEmbed response.
35+
#
36+
# This will be None if no cache-age is provided in the oEmbed response (or
37+
# if the oEmbed response cannot be turned into an Open Graph response).
38+
cache_age: Optional[int]
3939

4040

4141
class OEmbedProvider:
@@ -81,75 +81,106 @@ def get_oembed_url(self, url: str) -> Optional[str]:
8181
"""
8282
for url_pattern, endpoint in self._oembed_patterns.items():
8383
if url_pattern.fullmatch(url):
84-
return endpoint
84+
# TODO Specify max height / width.
85+
86+
# Note that only the JSON format is supported, some endpoints want
87+
# this in the URL, others want it as an argument.
88+
endpoint = endpoint.replace("{format}", "json")
89+
90+
args = {"url": url, "format": "json"}
91+
query_str = urllib.parse.urlencode(args, True)
92+
return f"{endpoint}?{query_str}"
8593

8694
# No match.
8795
return None
8896

89-
async def get_oembed_content(self, endpoint: str, url: str) -> OEmbedResult:
97+
def parse_oembed_response(self, url: str, raw_body: bytes) -> OEmbedResult:
9098
"""
91-
Request content from an oEmbed endpoint.
99+
Parse the oEmbed response into an Open Graph response.
92100
93101
Args:
94-
endpoint: The oEmbed API endpoint.
95-
url: The URL to pass to the API.
102+
url: The URL which is being previewed (not the one which was
103+
requested).
104+
raw_body: The oEmbed response as JSON encoded as bytes.
96105
97106
Returns:
98-
An object representing the metadata returned.
99-
100-
Raises:
101-
OEmbedError if fetching or parsing of the oEmbed information fails.
107+
json-encoded Open Graph data
102108
"""
103-
try:
104-
logger.debug("Trying to get oEmbed content for url '%s'", url)
105109

106-
# Note that only the JSON format is supported, some endpoints want
107-
# this in the URL, others want it as an argument.
108-
endpoint = endpoint.replace("{format}", "json")
109-
110-
result = await self._client.get_json(
111-
endpoint,
112-
# TODO Specify max height / width.
113-
args={"url": url, "format": "json"},
114-
)
110+
try:
111+
# oEmbed responses *must* be UTF-8 according to the spec.
112+
oembed = json_decoder.decode(raw_body.decode("utf-8"))
115113

116114
# Ensure there's a version of 1.0.
117-
if result.get("version") != "1.0":
118-
raise OEmbedError("Invalid version: %s" % (result.get("version"),))
119-
120-
oembed_type = result.get("type")
115+
oembed_version = oembed["version"]
116+
if oembed_version != "1.0":
117+
raise RuntimeError(f"Invalid version: {oembed_version}")
121118

122119
# Ensure the cache age is None or an int.
123-
cache_age = result.get("cache_age")
120+
cache_age = oembed.get("cache_age")
124121
if cache_age:
125122
cache_age = int(cache_age)
126123

127-
oembed_result = OEmbedResult(None, None, result.get("title"), cache_age)
124+
# The results.
125+
open_graph_response = {"og:title": oembed.get("title")}
128126

129-
# HTML content.
127+
# If a thumbnail exists, use it. Note that dimensions will be calculated later.
128+
if "thumbnail_url" in oembed:
129+
open_graph_response["og:image"] = oembed["thumbnail_url"]
130+
131+
# Process each type separately.
132+
oembed_type = oembed["type"]
130133
if oembed_type == "rich":
131-
oembed_result.html = result.get("html")
132-
return oembed_result
134+
calc_description_and_urls(open_graph_response, oembed["html"])
133135

134-
if oembed_type == "photo":
135-
oembed_result.url = result.get("url")
136-
return oembed_result
136+
elif oembed_type == "photo":
137+
# If this is a photo, use the full image, not the thumbnail.
138+
open_graph_response["og:image"] = oembed["url"]
137139

138-
# TODO Handle link and video types.
140+
else:
141+
raise RuntimeError(f"Unknown oEmbed type: {oembed_type}")
139142

140-
if "thumbnail_url" in result:
141-
oembed_result.url = result.get("thumbnail_url")
142-
return oembed_result
143+
except Exception as e:
144+
# Trap any exception and let the code follow as usual.
145+
logger.warning(f"Error parsing oEmbed metadata from {url}: {e:r}")
146+
open_graph_response = {}
147+
cache_age = None
143148

144-
raise OEmbedError("Incompatible oEmbed information.")
149+
return OEmbedResult(open_graph_response, cache_age)
145150

146-
except OEmbedError as e:
147-
# Trap OEmbedErrors first so we can directly re-raise them.
148-
logger.warning("Error parsing oEmbed metadata from %s: %r", url, e)
149-
raise
150151

151-
except Exception as e:
152-
# Trap any exception and let the code follow as usual.
153-
# FIXME: pass through 404s and other error messages nicely
154-
logger.warning("Error downloading oEmbed metadata from %s: %r", url, e)
155-
raise OEmbedError() from e
152+
def calc_description_and_urls(open_graph_response: JsonDict, html_body: str) -> None:
153+
"""
154+
Calculate description for an HTML document.
155+
156+
This uses lxml to convert the HTML document into plaintext. If errors
157+
occur during processing of the document, an empty response is returned.
158+
159+
Args:
160+
open_graph_response: The current Open Graph summary. This is updated with additional fields.
161+
html_body: The HTML document, as bytes.
162+
163+
Returns:
164+
The summary
165+
"""
166+
# If there's no body, nothing useful is going to be found.
167+
if not html_body:
168+
return
169+
170+
from lxml import etree
171+
172+
# Create an HTML parser. If this fails, log and return no metadata.
173+
parser = etree.HTMLParser(recover=True, encoding="utf-8")
174+
175+
# Attempt to parse the body. If this fails, log and return no metadata.
176+
tree = etree.fromstring(html_body, parser)
177+
178+
# The data was successfully parsed, but no tree was found.
179+
if tree is None:
180+
return
181+
182+
from synapse.rest.media.v1.preview_url_resource import _calc_description
183+
184+
description = _calc_description(tree)
185+
if description:
186+
open_graph_response["og:description"] = description

0 commit comments

Comments
 (0)