Python之单行代码搞定从markdown文件提取图片

2024-01-09 03:53:06

 





乍一看，这个任务似乎很简单。毕竟，markdown是一种简单的标记语言，对图像的使用也很明确。但是，魔鬼藏在细节里。

当我们尝试使用正则表达式从markdown文件中提取图像时，我们遇到了一个障碍：grep的基本正则表达式不支持“+”元字符。

$ grep -o "![.](.)" file.md


此命令将无法匹配包含多个单词的文件名，如“image with spaces.png”。

解决方法是使用扩展正则表达式。我们可以使用egrep或grep -E命令：

$ egrep -o "![.]((.))" file.md


现在，我们可以正确地提取图像了。但是，还有一个问题：有些图像可能没有扩展名。

$ egrep -o "![.]((.))" file.md
image with spaces
image.png


为了解决这个问题，我们可以使用正则表达式组：

$ egrep -o "![.]((...*))" file.md
image with spaces.png
image.png


现在，我们已经可以可靠地从markdown文件中提取图片了。

```python
import re

def extract_images(md_file):
  """Extract image URLs from a markdown file.

  Args:
    md_file: The path to the markdown file.

  Returns:
    A list of image URLs.
  """

  with open(md_file, "r") as f:
    md_text = f.read()

  image_urls = re.findall(r"!\[.*\]\((.*\..*)\)", md_text)

  return image_urls

我们还可以使用正则表达式组来提取图像名称：

import re

def extract_image_names(md_file):
  """Extract image names from a markdown file.

  Args:
    md_file: The path to the markdown file.

  Returns:
    A list of image names.
  """

  with open(md_file, "r") as f:
    md_text = f.read()

  image_names = re.findall(r"!\[(.*)\]\((.*)\)", md_text)

  return image_names