[PoC] Imprement yaml parser written by pure ruby instead of Psych#9352
[PoC] Imprement yaml parser written by pure ruby instead of Psych#9352
Conversation
…ncy support, and improve parsing and indentation
…ON, and harden YAML serializer parsing to improve safety and correctness
…logic, replace DisallowedClass with ArgumentError, prevent stalled parsing by shifting unchanged lines, and drop aliases check in safe_yaml
…ts in YAMLSerializer for improved performance and readability
…d YAML anchors and customized scalar quoting
There was a problem hiding this comment.
Pull request overview
This PR replaces RubyGems' dependency on Psych (a C extension YAML library) with a pure Ruby YAML parser implementation to allow users to freely specify library versions without C extension constraints. The implementation provides a custom Gem::YAMLSerializer module that handles YAML serialization and deserialization for Gem specifications and related data structures.
Changes:
- Implemented a new pure Ruby YAML parser in
lib/rubygems/yaml_serializer.rbwith custom dump and load methods - Removed dependency on Psych library and deleted
lib/rubygems/psych_tree.rb - Updated all YAML serialization/deserialization call sites to use the new YAMLSerializer API
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 21 comments.
Show a summary per file
| File | Description |
|---|---|
| lib/rubygems/yaml_serializer.rb | Complete rewrite from simple stub to full YAML parser with support for Gem objects, arrays, hashes, strings, and Ruby object tags |
| lib/rubygems.rb | Changed to load yaml_serializer instead of psych and psych_tree |
| lib/rubygems/safe_yaml.rb | Updated to use YAMLSerializer.load instead of Psych.safe_load |
| lib/rubygems/specification.rb | Simplified to_yaml method to use YAMLSerializer.dump directly |
| lib/rubygems/package.rb | Changed to use YAMLSerializer.dump for checksums |
| lib/rubygems/package/old.rb | Changed exception handling from Psych::SyntaxError to StandardError |
| lib/rubygems/commands/specification_command.rb | Updated to use YAMLSerializer.dump for spec output |
| lib/rubygems/config_file.rb | Added defensive check for respond_to?(:empty?) |
| lib/rubygems/ext/cargo_builder.rb | Correctly switched from YAML to JSON for parsing cargo metadata |
| test/rubygems/test_gem_safe_yaml.rb | Added pend statements for Psych-specific tests (logic appears inverted) |
| test/rubygems/test_gem_package.rb | Updated to use YAMLSerializer.dump in tests |
| test/rubygems/test_gem_commands_owner_command.rb | Changed expected exception from Psych::DisallowedClass to ArgumentError |
| test/rubygems/helper.rb | Updated load_yaml helper to use YAMLSerializer.load |
| lib/rubygems/psych_tree.rb | Deleted (no longer needed) |
| Manifest.txt | Removed psych_tree.rb entry |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
lib/rubygems/yaml_serializer.rb
Outdated
| if obj["value"] | ||
| Gem::Platform.new(obj["value"]) | ||
| else | ||
| Gem::Platform.new([obj["cpu"], obj["os"], obj["version"]]) | ||
| end | ||
| elsif ["!ruby/object:Gem::Requirement", "!ruby/object:Gem::Version::Requirement"].include?(obj[:tag]) | ||
| r = Gem::Requirement.allocate | ||
| reqs = convert_any(obj["requirements"] || obj["value"], permitted_symbols) | ||
| if reqs.is_a?(Array) | ||
| reqs.map! do |item| | ||
| if item.is_a?(Array) && item.size == 2 | ||
| [item[0], item[1].is_a?(Gem::Version) ? item[1] : Gem::Version.new(item[1].to_s)] | ||
| else | ||
| item |
There was a problem hiding this comment.
The convert_any function handles Gem::Version, Gem::Platform, Gem::Requirement, and Gem::Dependency by reconstructing them from their serialized form, but it doesn't handle "Gem::Version::Requirement" consistently with other permitted classes. The code checks for both "!ruby/object:Gem::Requirement" and "!ruby/object:Gem::Version::Requirement" (line 299), but Gem::Version::Requirement is listed separately in PERMITTED_CLASSES. Verify that this is intentional and that the handling is correct for both cases.
| res = parts.join | ||
| res << "\n" unless res.end_with?("\n") | ||
| res | ||
| elsif quote || obj.empty? || obj =~ /^[!*&:@%]/ || obj =~ /^-?\d+(\.\d+)?$/ || obj =~ /^[<>=-]/ || |
There was a problem hiding this comment.
The condition obj.empty? on line 79 checks for empty strings to force quoting, but this check comes after checking for newlines. However, the order of checks matters - an empty string won't match the newline check, so this is fine. But consider that the regex /^[!*&:@%]/ and other special character checks might not cover all YAML special cases. For example, strings starting with ? or - followed by space might also need quoting depending on context.
| elsif quote || obj.empty? || obj =~ /^[!*&:@%]/ || obj =~ /^-?\d+(\.\d+)?$/ || obj =~ /^[<>=-]/ || | |
| elsif quote || obj.empty? || obj =~ /^[!*&:@%]/ || obj =~ /^-?\d+(\.\d+)?$/ || obj =~ /^[<>=-]/ || | |
| obj =~ /^\?/ || obj =~ /^-(\s|$)/ || |
| $1.gsub(/''/, "'") | ||
| elsif val == "true" | ||
| true | ||
| elsif val == "false" |
There was a problem hiding this comment.
The inline array parsing logic val =~ /^\[(.*)\]$/ with val == "[]" ? [] : [$1.strip] is overly simplistic. It treats any non-empty bracketed content as a single-element array rather than properly parsing comma-separated values. For example, "[a, b, c]" would be parsed as a single-element array containing the string "a, b, c" instead of a three-element array. This doesn't match standard YAML inline array syntax.
| elsif val == "false" | |
| inner = $1.strip | |
| return [] if inner.empty? | |
| inner.split(/\s*,\s*/).reject(&:empty?).map { |element| unquote_simple(element) } |
| elsif val.empty? | ||
| if lines.any? && (lines[0].lstrip.start_with?("- ") || lines[0].lstrip == "-") && lines[0][/^ */].size == indent | ||
| res[key] = parse_any(lines, indent, permitted_tags, aliases) | ||
| else | ||
| res[key] = parse_any(lines, indent + 1, permitted_tags, aliases) | ||
| end |
There was a problem hiding this comment.
In the hash parsing logic, when a value is empty and followed by an array, the code calls parse_any(lines, indent, ...) (line 205), but when followed by other content, it calls parse_any(lines, indent + 1, ...) (line 207). This inconsistency could lead to incorrect indentation handling. The logic should consistently determine the expected indentation level for nested content based on whether it's an array or hash, not based on arbitrary checks.
| # frozen_string_literal: true | ||
|
|
||
| module Gem | ||
| # A stub yaml serializer that can handle only hashes and strings (as of now). | ||
| module YAMLSerializer | ||
| module_function | ||
|
|
||
| def dump(hash) | ||
| yaml = String.new("---") | ||
| yaml << dump_hash(hash) | ||
| def dump(obj) | ||
| "---#{dump_obj(obj, 0)}" | ||
| end | ||
|
|
||
| def dump_hash(hash) | ||
| yaml = String.new("\n") | ||
| hash.each do |k, v| | ||
| yaml << k << ":" | ||
| if v.is_a?(Hash) | ||
| yaml << dump_hash(v).gsub(/^(?!$)/, " ") # indent all non-empty lines | ||
| elsif v.is_a?(Array) # Expected to be array of strings | ||
| if v.empty? | ||
| yaml << " []\n" | ||
| else | ||
| yaml << "\n- " << v.map {|s| s.to_s.gsub(/\s+/, " ").inspect }.join("\n- ") << "\n" | ||
| def dump_obj(obj, indent, quote: false) | ||
| case obj | ||
| when Gem::Specification | ||
| parts = [" !ruby/object:Gem::Specification\n"] | ||
| parts << "#{" " * indent}name:#{dump_obj(obj.name, indent + 2)}" | ||
| parts << "#{" " * indent}version:#{dump_obj(obj.version, indent + 2)}" | ||
| parts << "#{" " * indent}platform: #{obj.platform}\n" | ||
| if obj.platform.to_s != obj.original_platform.to_s | ||
| parts << "#{" " * indent}original_platform: #{obj.original_platform}\n" | ||
| end | ||
|
|
||
| attributes = Gem::Specification.attribute_names.map(&:to_s).sort - %w[name version platform] | ||
| attributes.each do |name| | ||
| val = obj.instance_variable_get("@#{name}") | ||
| next if val.nil? | ||
| parts << "#{" " * indent}#{name}:#{dump_obj(val, indent + 2)}" | ||
| end | ||
| res = parts.join | ||
| res << "\n" unless res.end_with?("\n") | ||
| res | ||
| when Gem::Version | ||
| " !ruby/object:Gem::Version\n#{" " * indent}version: #{dump_obj(obj.version.to_s, indent + 2).lstrip}" | ||
| when Gem::Platform | ||
| " !ruby/object:Gem::Platform\n#{" " * indent}cpu: #{obj.cpu.inspect}\n#{" " * indent}os: #{obj.os.inspect}\n#{" " * indent}version: #{obj.version.inspect}\n" | ||
| when Gem::Requirement | ||
| " !ruby/object:Gem::Requirement\n#{" " * indent}requirements:#{dump_obj(obj.requirements, indent + 2)}" | ||
| when Gem::Dependency | ||
| [ | ||
| " !ruby/object:Gem::Dependency\n", | ||
| "#{" " * indent}name: #{dump_obj(obj.name, indent + 2).lstrip}", | ||
| "#{" " * indent}requirement:#{dump_obj(obj.requirement, indent + 2)}", | ||
| "#{" " * indent}type: #{dump_obj(obj.type, indent + 2).lstrip}", | ||
| "#{" " * indent}prerelease: #{dump_obj(obj.prerelease?, indent + 2).lstrip}", | ||
| "#{" " * indent}version_requirements:#{dump_obj(obj.requirement, indent + 2)}", | ||
| ].join | ||
| when Hash | ||
| if obj.empty? | ||
| " {}\n" | ||
| else | ||
| parts = ["\n"] | ||
| obj.each do |k, v| | ||
| is_symbol = k.is_a?(Symbol) || (k.is_a?(String) && k.start_with?(":")) | ||
| key_str = k.is_a?(Symbol) ? k.inspect : k.to_s | ||
| parts << "#{" " * indent}#{key_str}:#{dump_obj(v, indent + 2, quote: is_symbol)}" | ||
| end | ||
| parts.join | ||
| end | ||
| when Array | ||
| if obj.empty? | ||
| " []\n" | ||
| else | ||
| yaml << " " << v.to_s.gsub(/\s+/, " ").inspect << "\n" | ||
| parts = ["\n"] | ||
| obj.each do |v| | ||
| parts << "#{" " * indent}-#{dump_obj(v, indent + 2)}" | ||
| end | ||
| parts.join | ||
| end | ||
| when Time | ||
| " #{obj.utc.strftime("%Y-%m-%d %H:%M:%S.%N Z")}\n" | ||
| when String | ||
| if obj.include?("\n") | ||
| parts = [obj.end_with?("\n") ? " |\n" : " |-\n"] | ||
| obj.each_line do |line| | ||
| parts << "#{" " * (indent + 2)}#{line}" | ||
| end | ||
| res = parts.join | ||
| res << "\n" unless res.end_with?("\n") | ||
| res | ||
| elsif quote || obj.empty? || obj =~ /^[!*&:@%]/ || obj =~ /^-?\d+(\.\d+)?$/ || obj =~ /^[<>=-]/ || | ||
| obj == "true" || obj == "false" || obj == "nil" || | ||
| obj.include?(":") || obj.include?("#") || obj.include?("[") || obj.include?("]") || | ||
| obj.include?("{") || obj.include?("}") || obj.include?(",") | ||
| " #{obj.to_s.inspect}\n" | ||
| else | ||
| " #{obj}\n" | ||
| end | ||
| when Numeric, Symbol, TrueClass, FalseClass, nil | ||
| " #{obj.inspect}\n" | ||
| else | ||
| " #{obj.to_s.inspect}\n" | ||
| end | ||
| yaml | ||
| end | ||
|
|
||
| ARRAY_REGEX = / | ||
| ^ | ||
| (?:[ ]*-[ ]) # '- ' before array items | ||
| (['"]?) # optional opening quote | ||
| (.*) # value | ||
| \1 # matching closing quote | ||
| $ | ||
| /xo | ||
|
|
||
| HASH_REGEX = / | ||
| ^ | ||
| ([ ]*) # indentations | ||
| ([^#]+) # key excludes comment char '#' | ||
| (?::(?=(?:\s|$))) # : (without the lookahead the #key includes this when : is present in value) | ||
| [ ]? | ||
| (['"]?) # optional opening quote | ||
| (.*) # value | ||
| \3 # matching closing quote | ||
| $ | ||
| /xo | ||
|
|
||
| def load(str) | ||
| res = {} | ||
| stack = [res] | ||
| last_hash = nil | ||
| last_empty_key = nil | ||
| str.split(/\r?\n/) do |line| | ||
| if match = HASH_REGEX.match(line) | ||
| indent, key, quote, val = match.captures | ||
| val = strip_comment(val) | ||
| def load(str, permitted_classes: [], permitted_symbols: [], aliases: true) | ||
| return {} if str.nil? || str.empty? | ||
| lines = str.split(/\r?\n/) | ||
| if lines[0]&.start_with?("---") | ||
| if lines[0].strip == "---" | ||
| lines.shift | ||
| else | ||
| lines[0] = lines[0].sub(/^---\s*/, "") | ||
| end | ||
| end | ||
|
|
||
| permitted_tags = build_permitted_tags(permitted_classes) | ||
| data = nil | ||
| while lines.any? | ||
| before_count = lines.size | ||
| parsed = parse_any(lines, -1, permitted_tags, aliases) | ||
| if lines.size == before_count && lines.any? | ||
| lines.shift | ||
| end | ||
|
|
||
| depth = indent.size / 2 | ||
| if quote.empty? && val.empty? | ||
| new_hash = {} | ||
| stack[depth][key] = new_hash | ||
| stack[depth + 1] = new_hash | ||
| last_empty_key = key | ||
| last_hash = stack[depth] | ||
| if data.is_a?(Hash) && parsed.is_a?(Hash) | ||
| data.merge!(parsed) | ||
| elsif data.nil? | ||
| data = parsed | ||
| end | ||
| end | ||
|
|
||
| if data.is_a?(Hash) && (data[:tag] == "!ruby/object:Gem::Specification" || data["tag"] == "!ruby/object:Gem::Specification") | ||
| convert_to_spec(data, permitted_symbols) | ||
| else | ||
| convert_any(data, permitted_symbols) | ||
| end | ||
| end | ||
|
|
||
| def parse_any(lines, base_indent, permitted_tags, aliases) | ||
| while lines.any? && (lines[0].strip.empty? || lines[0].lstrip.start_with?("#")) | ||
| lines.shift | ||
| end | ||
| return nil if lines.empty? | ||
|
|
||
| indent = lines[0][/^ */].size | ||
| return nil if indent < base_indent | ||
|
|
||
| line = lines[0] | ||
| if !aliases && line.lstrip.start_with?("*", "&") | ||
| raise ArgumentError, "YAML aliases are not allowed" | ||
| end | ||
| if line.lstrip.start_with?("- ") || line.lstrip == "-" | ||
| res = [] | ||
| while lines.any? && lines[0][/^ */].size == indent && (lines[0].lstrip.start_with?("- ") || lines[0].lstrip == "-") | ||
| l = lines.shift | ||
| content = l.lstrip[1..-1].strip | ||
| if content.empty? | ||
| res << parse_any(lines, indent, permitted_tags, aliases) | ||
| elsif !aliases && content.start_with?("*", "&") | ||
| raise ArgumentError, "YAML aliases are not allowed" | ||
| elsif content.start_with?("!ruby/object:") | ||
| tag = content.strip | ||
| unless permitted_tags.include?(tag) | ||
| raise ArgumentError, "Disallowed class: #{tag}" | ||
| end | ||
| nested = parse_any(lines, indent, permitted_tags, aliases) | ||
| if nested.is_a?(Hash) | ||
| nested[:tag] = tag | ||
| res << nested | ||
| else | ||
| res << { :tag => tag, "value" => nested } | ||
| end | ||
| elsif content.start_with?("-") | ||
| lines.unshift(" " * (indent + 2) + content) | ||
| res << parse_any(lines, indent, permitted_tags, aliases) | ||
| elsif content =~ /^((?:[^#:]|:[^ ])+):(?:[ ]+(.*))?$/ && !content.start_with?("!ruby/object:") | ||
| lines.unshift(" " * (indent + 2) + content) | ||
| res << parse_any(lines, indent, permitted_tags, aliases) | ||
| elsif content.start_with?("|") | ||
| modifier = content[1..-1].to_s.strip | ||
| res << parse_block_scalar(lines, indent, modifier) | ||
| else | ||
| val = [] if val == "[]" # empty array | ||
| stack[depth][key] = val | ||
| str = unquote_simple(content) | ||
| while lines.any? && !lines[0].strip.empty? && lines[0][/^ */].size > indent | ||
| str << " " << lines.shift.strip | ||
| end | ||
| res << str | ||
| end | ||
| elsif match = ARRAY_REGEX.match(line) | ||
| _, val = match.captures | ||
| end | ||
| res | ||
| elsif line.lstrip =~ /^((?:[^#:]|:[^ ])+):(?:[ ]+(.*))?$/ && !line.lstrip.start_with?("!ruby/object:") | ||
| res = Hash.new | ||
| while lines.any? && lines[0][/^ */].size == indent && lines[0].lstrip =~ /^((?:[^#:]|:[^ ])+):(?:[ ]+(.*))?$/ && !lines[0].lstrip.start_with?("!ruby/object:") | ||
| l = lines.shift | ||
| l.lstrip =~ /^((?:[^#:]|:[^ ])+):(?:[ ]+(.*))?$/ | ||
| key = $1.strip | ||
| val = $2.to_s.strip | ||
| val = strip_comment(val) | ||
|
|
||
| last_hash[last_empty_key] = [] unless last_hash[last_empty_key].is_a?(Array) | ||
| if val.start_with?("!ruby/object:") | ||
| tag = val.strip | ||
| unless permitted_tags.include?(tag) | ||
| raise ArgumentError, "Disallowed class: #{tag}" | ||
| end | ||
| nested = parse_any(lines, indent, permitted_tags, aliases) | ||
| if nested.is_a?(Hash) | ||
| nested[:tag] = tag | ||
| res[key] = nested | ||
| else | ||
| res[key] = { :tag => tag, "value" => nested } | ||
| end | ||
| elsif !aliases && val.start_with?("*", "&") | ||
| raise ArgumentError, "YAML aliases are not allowed" | ||
| elsif val.empty? | ||
| if lines.any? && (lines[0].lstrip.start_with?("- ") || lines[0].lstrip == "-") && lines[0][/^ */].size == indent | ||
| res[key] = parse_any(lines, indent, permitted_tags, aliases) | ||
| else | ||
| res[key] = parse_any(lines, indent + 1, permitted_tags, aliases) | ||
| end | ||
| elsif val == "[]" | ||
| res[key] = [] | ||
| elsif val == "{}" | ||
| res[key] = {} | ||
| elsif val.start_with?("|") | ||
| modifier = val[1..-1].to_s.strip | ||
| res[key] = parse_block_scalar(lines, indent, modifier) | ||
| else | ||
| str = unquote_simple(val) | ||
| while lines.any? && !lines[0].strip.empty? && lines[0][/^ */].size > indent | ||
| str << " " << lines.shift.strip | ||
| end | ||
| res[key] = str | ||
| end | ||
| end | ||
| res | ||
| elsif line.lstrip.start_with?("!ruby/object:") | ||
| tag = lines.shift.lstrip.strip | ||
| unless permitted_tags.include?(tag) | ||
| raise ArgumentError, "Disallowed class: #{tag}" | ||
| end | ||
| nested = parse_any(lines, indent, permitted_tags, aliases) | ||
| if nested.is_a?(Hash) | ||
| nested[:tag] = tag | ||
| nested | ||
| else | ||
| { :tag => tag, "value" => nested } | ||
| end | ||
| elsif line.lstrip.start_with?("|") | ||
| modifier = line.lstrip[1..-1].to_s.strip | ||
| lines.shift | ||
| parse_block_scalar(lines, indent, modifier) | ||
| else | ||
| str = unquote_simple(lines.shift.strip) | ||
| while lines.any? && !lines[0].strip.empty? && lines[0][/^ */].size > indent | ||
| str << " " << lines.shift.strip | ||
| end | ||
| str | ||
| end | ||
| end | ||
|
|
||
| last_hash[last_empty_key].push(val) | ||
| def parse_block_scalar(lines, base_indent, modifier) | ||
| parts = [] | ||
| block_indent = nil | ||
| while lines.any? | ||
| if lines[0].strip.empty? | ||
| parts << "\n" | ||
| lines.shift | ||
| else | ||
| line_indent = lines[0][/^ */].size | ||
| break if line_indent <= base_indent | ||
| block_indent ||= line_indent | ||
| l = lines.shift | ||
| parts << l[block_indent..-1].to_s << "\n" | ||
| end | ||
| end | ||
| res = parts.join | ||
| res.chomp! if modifier == "-" && res.end_with?("\n") | ||
| res | ||
| end | ||
|
|
||
| def build_permitted_tags(permitted_classes) | ||
| Array(permitted_classes).map do |klass| | ||
| name = klass.is_a?(Module) ? klass.name : klass.to_s | ||
| "!ruby/object:#{name}" | ||
| end | ||
| end | ||
|
|
||
| def convert_to_spec(hash, permitted_symbols) | ||
| spec = Gem::Specification.allocate | ||
| return spec unless hash.is_a?(Hash) | ||
|
|
||
| converted_hash = {} | ||
| hash.each {|k, v| converted_hash[k] = convert_any(v, permitted_symbols) } | ||
|
|
||
| # Ensure specification_version is an Integer | ||
| if converted_hash["specification_version"] && !converted_hash["specification_version"].is_a?(Integer) | ||
| converted_hash["specification_version"] = converted_hash["specification_version"].to_i | ||
| end | ||
|
|
||
| spec.yaml_initialize("!ruby/object:Gem::Specification", converted_hash) | ||
| spec | ||
| end | ||
|
|
||
| def convert_any(obj, permitted_symbols) | ||
| if obj.is_a?(Hash) | ||
| if obj[:tag] == "!ruby/object:Gem::Version" | ||
| ver = obj["version"] || obj["value"] | ||
| Gem::Version.new(ver.to_s) | ||
| elsif obj[:tag] == "!ruby/object:Gem::Platform" | ||
| if obj["value"] | ||
| Gem::Platform.new(obj["value"]) | ||
| else | ||
| Gem::Platform.new([obj["cpu"], obj["os"], obj["version"]]) | ||
| end | ||
| elsif ["!ruby/object:Gem::Requirement", "!ruby/object:Gem::Version::Requirement"].include?(obj[:tag]) | ||
| r = Gem::Requirement.allocate | ||
| reqs = convert_any(obj["requirements"] || obj["value"], permitted_symbols) | ||
| if reqs.is_a?(Array) | ||
| reqs.map! do |item| | ||
| if item.is_a?(Array) && item.size == 2 | ||
| [item[0], item[1].is_a?(Gem::Version) ? item[1] : Gem::Version.new(item[1].to_s)] | ||
| else | ||
| item | ||
| end | ||
| end | ||
| end | ||
| r.instance_variable_set(:@requirements, reqs) | ||
| r | ||
| elsif obj[:tag] == "!ruby/object:Gem::Dependency" | ||
| d = Gem::Dependency.allocate | ||
| d.instance_variable_set(:@name, obj["name"]) | ||
| d.instance_variable_set(:@requirement, convert_any(obj["requirement"], permitted_symbols)) | ||
|
|
||
| type = obj["type"] | ||
| if type | ||
| type = type.to_s.sub(/^:/, "").to_sym | ||
| else | ||
| type = :runtime | ||
| end | ||
| if permitted_symbols.any? && !permitted_symbols.include?(type.to_s) | ||
| raise ArgumentError, "Disallowed symbol: #{type.inspect}" | ||
| end | ||
| d.instance_variable_set(:@type, type) | ||
|
|
||
| d.instance_variable_set(:@prerelease, ["true", true].include?(obj["prerelease"])) | ||
| d.instance_variable_set(:@version_requirements, d.instance_variable_get(:@requirement)) | ||
| d | ||
| else | ||
| res = Hash.new | ||
| obj.each do |k, v| | ||
| next if k == :tag | ||
| res[k.to_s] = convert_any(v, permitted_symbols) | ||
| end | ||
| res | ||
| end | ||
| elsif obj.is_a?(Array) | ||
| obj.map {|i| convert_any(i, permitted_symbols) } | ||
| else | ||
| obj | ||
| end | ||
| end | ||
|
|
||
| def strip_comment(val) | ||
| if val.include?("#") && !val.start_with?("#") | ||
| if val.include?("#") && !val.lstrip.start_with?("#") | ||
| val.split("#", 2).first.strip | ||
| else | ||
| val | ||
| end | ||
| end | ||
|
|
||
| class << self | ||
| private :dump_hash | ||
| def unquote_simple(val) | ||
| if val =~ /^"(.*)"$/ | ||
| $1.gsub(/\\"/, '"').gsub(/\\n/, "\n").gsub(/\\r/, "\r").gsub(/\\t/, "\t").gsub(/\\\\/, "\\") | ||
| elsif val =~ /^'(.*)'$/ | ||
| $1.gsub(/''/, "'") | ||
| elsif val == "true" | ||
| true | ||
| elsif val == "false" | ||
| false | ||
| elsif val == "nil" | ||
| nil | ||
| elsif val =~ /^\[(.*)\]$/ | ||
| val == "[]" ? [] : [$1.strip] | ||
| elsif /^\d{4}-\d{2}-\d{2}/.match?(val) | ||
| require "time" | ||
| begin | ||
| Time.parse(val) | ||
| rescue ArgumentError | ||
| val | ||
| end | ||
| elsif /^-?\d+$/.match?(val) | ||
| val.to_i | ||
| else |
There was a problem hiding this comment.
This is a complete rewrite of the YAML parsing and serialization logic, replacing the battle-tested Psych library with a custom implementation. However, there appear to be no dedicated tests for the YAMLSerializer module itself. The existing tests (like test_to_yaml in test_gem_specification.rb) may provide some coverage, but a parser of this complexity requires comprehensive unit tests covering edge cases, error handling, various YAML constructs, and round-trip serialization. Consider adding a dedicated test file test_gem_yaml_serializer.rb with extensive test coverage.
| parts.join | ||
| end | ||
| when Time | ||
| " #{obj.utc.strftime("%Y-%m-%d %H:%M:%S.%N Z")}\n" |
There was a problem hiding this comment.
The Time dumping format uses %Y-%m-%d %H:%M:%S.%N Z which outputs nanoseconds with 9 digits. However, the parsing code in unquote_simple uses a simple regex /^\d{4}-\d{2}-\d{2}/ and Time.parse(val) which may not correctly handle the nanosecond precision. This could lead to loss of precision in round-trip serialization/deserialization. Consider using a more consistent format or ensuring the parser can handle the full precision.
| " #{obj.utc.strftime("%Y-%m-%d %H:%M:%S.%N Z")}\n" | |
| utc = obj.utc | |
| " #{utc.strftime("%Y-%m-%d %H:%M:%S")}.#{format("%06d", utc.usec)} Z\n" |
|
|
||
| spec.yaml_initialize("!ruby/object:Gem::Specification", converted_hash) | ||
| spec | ||
| end | ||
|
|
||
| def convert_any(obj, permitted_symbols) | ||
| if obj.is_a?(Hash) | ||
| if obj[:tag] == "!ruby/object:Gem::Version" | ||
| ver = obj["version"] || obj["value"] | ||
| Gem::Version.new(ver.to_s) | ||
| elsif obj[:tag] == "!ruby/object:Gem::Platform" | ||
| if obj["value"] | ||
| Gem::Platform.new(obj["value"]) | ||
| else | ||
| Gem::Platform.new([obj["cpu"], obj["os"], obj["version"]]) | ||
| end | ||
| elsif ["!ruby/object:Gem::Requirement", "!ruby/object:Gem::Version::Requirement"].include?(obj[:tag]) | ||
| r = Gem::Requirement.allocate | ||
| reqs = convert_any(obj["requirements"] || obj["value"], permitted_symbols) | ||
| if reqs.is_a?(Array) | ||
| reqs.map! do |item| | ||
| if item.is_a?(Array) && item.size == 2 | ||
| [item[0], item[1].is_a?(Gem::Version) ? item[1] : Gem::Version.new(item[1].to_s)] | ||
| else | ||
| item | ||
| end | ||
| end | ||
| end | ||
| r.instance_variable_set(:@requirements, reqs) | ||
| r | ||
| elsif obj[:tag] == "!ruby/object:Gem::Dependency" | ||
| d = Gem::Dependency.allocate | ||
| d.instance_variable_set(:@name, obj["name"]) | ||
| d.instance_variable_set(:@requirement, convert_any(obj["requirement"], permitted_symbols)) | ||
|
|
||
| type = obj["type"] | ||
| if type | ||
| type = type.to_s.sub(/^:/, "").to_sym | ||
| else | ||
| type = :runtime | ||
| end | ||
| if permitted_symbols.any? && !permitted_symbols.include?(type.to_s) | ||
| raise ArgumentError, "Disallowed symbol: #{type.inspect}" | ||
| end | ||
| d.instance_variable_set(:@type, type) | ||
|
|
||
| d.instance_variable_set(:@prerelease, ["true", true].include?(obj["prerelease"])) | ||
| d.instance_variable_set(:@version_requirements, d.instance_variable_get(:@requirement)) | ||
| d | ||
| else | ||
| res = Hash.new | ||
| obj.each do |k, v| | ||
| next if k == :tag | ||
| res[k.to_s] = convert_any(v, permitted_symbols) | ||
| end | ||
| res | ||
| end | ||
| elsif obj.is_a?(Array) |
There was a problem hiding this comment.
Symbol is listed in PERMITTED_CLASSES but there's no corresponding handler in convert_any for "!ruby/object:Symbol" tags. If a YAML document contains an explicitly tagged Symbol object, it will pass the security check (lines 227-228, 191-192, 152-153) but won't be properly converted to a Symbol object - it will be returned as a hash with a :tag key. Consider either removing Symbol from PERMITTED_CLASSES (since symbols are typically parsed directly from :symbol syntax) or adding proper Symbol conversion logic.
| pend "Psych is not loaded" if defined?(Gem::YAMLSerializer) | ||
| assert_predicate Gem::SafeYAML, :aliases_enabled? | ||
| assert_equal({ "a" => "a", "b" => "a" }, Gem::SafeYAML.safe_load("a: &a a\nb: *a\n")) | ||
| end | ||
|
|
||
| def test_aliases_disabled | ||
| pend "Psych is not loaded" if defined?(Gem::YAMLSerializer) |
There was a problem hiding this comment.
The logic for this pend statement is inverted. The test should be skipped when YAMLSerializer IS loaded (since this PR replaces Psych with YAMLSerializer). Currently, the test is only skipped when YAMLSerializer is defined, but since it's now always loaded via Gem.load_yaml, this test will always be skipped. The condition should be checking if Psych is NOT loaded instead, or the entire pend line should be removed since YAMLSerializer is now the default.
| pend "Psych is not loaded" if defined?(Gem::YAMLSerializer) | |
| assert_predicate Gem::SafeYAML, :aliases_enabled? | |
| assert_equal({ "a" => "a", "b" => "a" }, Gem::SafeYAML.safe_load("a: &a a\nb: *a\n")) | |
| end | |
| def test_aliases_disabled | |
| pend "Psych is not loaded" if defined?(Gem::YAMLSerializer) | |
| pend "Psych is not loaded" unless defined?(Psych) | |
| assert_predicate Gem::SafeYAML, :aliases_enabled? | |
| assert_equal({ "a" => "a", "b" => "a" }, Gem::SafeYAML.safe_load("a: &a a\nb: *a\n")) | |
| end | |
| def test_aliases_disabled | |
| pend "Psych is not loaded" unless defined?(Psych) |
| while lines.any? | ||
| before_count = lines.size | ||
| parsed = parse_any(lines, -1, permitted_tags, aliases) | ||
| if lines.size == before_count && lines.any? |
There was a problem hiding this comment.
The infinite loop protection mechanism that shifts a line when no progress is made could silently swallow parse errors. If parse_any fails to consume any lines (returning nil or empty result), the loop shifts a line and continues, potentially losing data. Consider adding error logging or raising an exception when this fallback is triggered, or at minimum document why silent skipping is acceptable.
| if lines.size == before_count && lines.any? | |
| if lines.size == before_count && lines.any? | |
| warn "Gem::YAMLSerializer.load: skipping unparseable line #{lines.first.inspect} because it could not be parsed" if defined?(warn) |
| def parse_block_scalar(lines, base_indent, modifier) | ||
| parts = [] | ||
| block_indent = nil | ||
| while lines.any? | ||
| if lines[0].strip.empty? | ||
| parts << "\n" | ||
| lines.shift | ||
| else | ||
| line_indent = lines[0][/^ */].size | ||
| break if line_indent <= base_indent | ||
| block_indent ||= line_indent | ||
| l = lines.shift | ||
| parts << l[block_indent..-1].to_s << "\n" | ||
| end | ||
| end | ||
| res = parts.join | ||
| res.chomp! if modifier == "-" && res.end_with?("\n") | ||
| res |
There was a problem hiding this comment.
The parse_block_scalar function only handles the "-" modifier (chomping) but YAML block scalars support other modifiers like "+" (keeping final newlines) and numeric indentation indicators (e.g., "|2"). The current implementation will treat "|+" or "|2" as unrecognized modifiers and won't apply the correct semantics, potentially leading to incorrect parsing of multiline strings. Consider implementing full block scalar modifier support or documenting the limitations.
…n and add SafeYAML tests for specification_version, disallowed classes/symbols, and aliases
…cation_version conversion to numeric strings; update tests to assert error messages and add real gemspec parsing test
…turn {} unless content is a Hash to avoid unsafe loads and ensure a Hash result
…r, ensure requirements/rdoc_options become arrays and tolerate malformed versions, and add tests covering anchors and parsing fixes
…ons and other array fields are correctly typed
What was the end-user or developer problem that led to this PR?
We would like to allow users to freely specify the version, but the version of the libraries used internally like
Psychcannot be specified.Vendoring is workaround solution, But we can't vendor Psych because that is C extension.
What is your fix for the problem, implemented in this PR?
I tried to rewrite the minimum specification of YAML with Gemini.
Make sure the following tasks are checked