Задача 1

Naksen · web-flow · commit 5fb48ba75850 · 2022-01-15T02:08:15.000+05:00
diff --git a/jupyter/Math_and_Python_for_Data_analysis/week_2/sentences.txt b/jupyter/Math_and_Python_for_Data_analysis/week_2/sentences.txt
@@ -0,0 +1,22 @@
+In comparison to dogs, cats have not undergone major changes during the domestication process.
+As cat simply catenates streams of bytes, it can be also used to concatenate binary files, where it will just concatenate sequence of bytes.
+A common interactive use of cat for a single file is to output the content of a file to standard output.
+Cats can hear sounds too faint or too high in frequency for human ears, such as those made by mice and other small animals.
+In one, people deliberately tamed cats in a process of artificial selection, as they were useful predators of vermin.
+The domesticated cat and its closest wild ancestor are both diploid organisms that possess 38 chromosomes and roughly 20,000 genes.
+Domestic cats are similar in size to the other members of the genus Felis, typically weighing between 4 and 5 kg (8.8 and 11.0 lb).
+However, if the output is piped or redirected, cat is unnecessary.
+cat with one named file is safer where human error is a concern - one wrong use of the default redirection symbol ">" instead of "<" (often adjacent on keyboards) may permanently delete the file you were just needing to read.
+In terms of legibility, a sequence of commands starting with cat and connected by pipes has a clear left-to-right flow of information.
+Cat command is one of the basic commands that you learned when you started in the Unix / Linux world.
+Using cat command, the lines received from stdin can be redirected to a new file using redirection symbols.
+When you type simply cat command without any arguments, it just receives the stdin content and displays it in the stdout.
+Leopard was released on October 26, 2007 as the successor of Tiger (version 10.4), and is available in two editions.
+According to Apple, Leopard contains over 300 changes and enhancements over its predecessor, Mac OS X Tiger.
+As of Mid 2010, some Apple computers have firmware factory installed which will no longer allow installation of Mac OS X Leopard.
+Since Apple moved to using Intel processors in their computers, the OSx86 community has developed and now also allows Mac OS X Tiger and later releases to be installed on non-Apple x86-based computers.
+OS X Mountain Lion was released on July 25, 2012 for purchase and download through Apple's Mac App Store, as part of a switch to releasing OS X versions online and every year.
+Apple has released a small patch for the three most recent versions of Safari running on OS X Yosemite, Mavericks, and Mountain Lion.
+The Mountain Lion release marks the second time Apple has offered an incremental upgrade, rather than releasing a new cat entirely.
+Mac OS X Mountain Lion installs in place, so you won't need to create a separate disk or run the installation off an external drive.
+The fifth major update to Mac OS X, Leopard, contains such a mountain of features - more than 300 by Apple's count.
diff --git a/jupyter/Math_and_Python_for_Data_analysis/week_2/week_2_task_1.ipynb b/jupyter/Math_and_Python_for_Data_analysis/week_2/week_2_task_1.ipynb
@@ -0,0 +1,148 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "2087a1a9",
+   "metadata": {},
+   "source": [
+    "## Задача 1: Сравнение предложений\n",
+    "Дан набор предложений, скопированных с Википедии. Каждое из них имеет \"кошачью тему\" в одном из трех смыслов:\n",
+    "* кошки (животные)\n",
+    "* UNIX-утилита cat для вывода содержимого файлов\n",
+    "* версии операционной системы OS X, названные в честь семейства кошачьих  \n",
+    "\n",
+    "Задача - найти два предложения, которые ближе всего по смыслу к расположенному в самой первой строке. В качестве меры близости по смыслу использовать косинусное расстояние.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 191,
+   "id": "d60fac0d",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "(22, 254)\n"
+     ]
+    }
+   ],
+   "source": [
+    "import scipy as sp\n",
+    "import numpy as np\n",
+    "import re\n",
+    "# Открываем файл и считываем все предложения переводя их в нижний регистр\n",
+    "f = open('sentences.txt', 'r')\n",
+    "t = f.read().lower()\n",
+    "# Заводим список состоящий из всех предложений\n",
+    "sentences = t.split(\"\\n\")\n",
+    "# n - количество предложений (в конце - 1, потому что в конце файла стоит переход строки)\n",
+    "n = len(t.split(\"\\n\")) - 1\n",
+    "# Делим весь текст t на слова и убираем пустые, результат записываем в переменную b\n",
+    "a = re.split('[^a-z]', t)\n",
+    "b = []\n",
+    "for word in a:\n",
+    "    if word == '':\n",
+    "        continue\n",
+    "    b.append(word)\n",
+    "# Составляем словарь из всех различных слов, индексом каждого слова является число от 0 до (d - 1), где d - число различных слов в тексте\n",
+    "dic = {}\n",
+    "d = 0\n",
+    "for word in b:\n",
+    "    if dic.get(word) == None:\n",
+    "        dic[word] = d\n",
+    "        d += 1\n",
+    "# Создаем матрицу размером n на d\n",
+    "matrix = np.zeros((n, d))\n",
+    "print(matrix.shape)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 192,
+   "id": "1659d59d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Функция считает количество слов word в предложении sen\n",
+    "def cnt_of_word_in_sentence(sen, word):\n",
+    "    words_in = re.split('[^a-z]', sen)\n",
+    "    ans = 0\n",
+    "    for i in words_in:\n",
+    "        if i == word:\n",
+    "            ans += 1\n",
+    "    return ans\n",
+    "# Заполняем матрицу: элемент с индексом (i, j) равен количеству вхождений j-го слова в i-е предложение\n",
+    "for i in range(n):\n",
+    "    for k, j in dic.items():\n",
+    "        matrix[i, j] = cnt_of_word_in_sentence(sentences[i], k)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 193,
+   "id": "ca122f11",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "1 : 0.9527544408738466\n",
+      "2 : 0.8644738145642124\n",
+      "3 : 0.8951715163278082\n",
+      "4 : 0.7770887149698589\n",
+      "5 : 0.9402385695332803\n",
+      "6 : 0.7327387580875756\n",
+      "7 : 0.9258750683338899\n",
+      "8 : 0.8842724875284311\n",
+      "9 : 0.9055088817476932\n",
+      "10 : 0.8328165362273942\n",
+      "11 : 0.8804771390665607\n",
+      "12 : 0.8396432548525454\n",
+      "13 : 0.8703592552895671\n",
+      "14 : 0.8740118423302576\n",
+      "15 : 0.9442721787424647\n",
+      "16 : 0.8406361854220809\n",
+      "17 : 0.956644501523794\n",
+      "18 : 0.9442721787424647\n",
+      "19 : 0.8885443574849294\n",
+      "20 : 0.8427572744917122\n",
+      "21 : 0.8250364469440588\n",
+      "Answer is 4 and 6\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Находим косинусные расстояния от нулевой строки до всех остальных\n",
+    "from scipy.spatial import distance\n",
+    "for i in range(1, n):\n",
+    "    print(i,\":\", sp.spatial.distance.cosine(matrix[0],matrix[i]))\n",
+    "# Индексы строк ближайших к нулевой по косинусному расстоянию: 4 и 6\n",
+    "print(\"Answer is 4 and 6\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}