2016/05/05

[Apache Tika] How to Check File's Media Type?

Problem
We have a file upload function, we need to check the file media type is as expected. 
For example, if the file upload function only accept xls or xlsx file, other media type is disallowed.
How to do it?

How-to
Before cope with this problem, you need to add tika dependency in your pom.xml
1
2
3
4
5
6
        <dependency>
            <groupId>org.apache.tika</groupId>
            <artifactId>tika-core</artifactId>
            <version>1.9</version>
            <scope>compile</scope>
        </dependency>

Assume we will check some media types as bellows:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
package albert.practice.file;

//media type list: http://www.iana.org/assignments/media-types/media-types.xhtml
public class MediaTypes {

 public static final String DOC = "application/msword";
 public static final String DOCX = "application/vnd.openxmlformats-officedocument.wordprocessingml.document";
 public static final String XLSX = "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet";
 public static final String TXT = "text/plain";
 public static final String JPG = "image/jpeg";
 public static final String PDF = "application/pdf";
 public static final String EXE = "application/x-msdownload";

}

Here is the approach to check file media type:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
package albert.practice.file;

import java.io.File;
import java.io.IOException;

import org.apache.commons.io.FileUtils;
import org.apache.tika.Tika;

/**
 * 
 * @author albert
 * 
 */
public class FileMediaTypeTest {

 // set up the path and name of test file
 private final String DROPBOX_HOME = "/Users/albert/Dropbox/";
 private final String DOC_FILE = DROPBOX_HOME + "庫務組/交付文件(第二階段)/PDS/NTA_PSR_FMS_PDS_1030930_V1.0.doc";
 private final String DOCX_FILE = DROPBOX_HOME + "債務管理系統/NTA_IFMIS_DBM_測試及操作文件/DBM001E.docx";
 private final String XLSX_FILE = DROPBOX_HOME
   + "庫務組/交付文件(第二階段)/PDS/NTA_PSR_FMS_PDS_附件/NTA_PSR_FMS_PDS_1030901_需求追溯表.xlsx";
 private final String TXT_FILE = DROPBOX_HOME + "庫務組/測試.txt";
 private final String JPG_FILE = DROPBOX_HOME + "庫務組/ads100fa.jpg";
 private final String PDF_FILE = DROPBOX_HOME + "eBooks/Head First Python.pdf";
 private final String FAKE_EXE_FILE = DROPBOX_HOME + "eBooks/The Intelligent Investor 拷貝.exe";

 public static void main(String[] args) throws IOException {
  new FileMediaTypeTest().testFileMediaType();
 }

 public void testFileMediaType() throws IOException {
  checkFileMediaType(DOC_FILE, MediaTypes.DOC);
  checkFileMediaType(DOCX_FILE, MediaTypes.DOCX);
  checkFileMediaType(XLSX_FILE, MediaTypes.XLSX);
  checkFileMediaType(TXT_FILE, MediaTypes.TXT);
  checkFileMediaType(JPG_FILE, MediaTypes.JPG);
  checkFileMediaType(PDF_FILE, MediaTypes.PDF);
  checkFileMediaType(FAKE_EXE_FILE, MediaTypes.EXE);
 }

 public void checkFileMediaType(String sourceFile, String expectedMediaType) throws IOException {

  File file = FileUtils.getFile(sourceFile);
  try {
   Tika tika = new Tika();

   // Detects the media type of the given file. The type detection is
   // based on the document content and a potential known file
   // extension.
   String mediaType = tika.detect(file);

   System.out.println("\nchecking " + sourceFile + "...");

   if (!(expectedMediaType.equals(mediaType))) {
    String actualMediaTypeName = mediaType;
    String errorMsg = "Wrong media type ! Expected:" + expectedMediaType + ", Actual:"
      + actualMediaTypeName;
    System.err.println(errorMsg);
    throw new RuntimeException(errorMsg);
   } else {
    System.out.println("Correct media type : " + mediaType);
   }

  } catch (IOException e) {
   e.printStackTrace();
  }
 }

}

Console looks like this:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
checking /Users/albert/Dropbox/庫務組/交付文件(第二階段)/PDS/NTA_PSR_FMS_PDS_1030930_V1.0.doc...
Correct media type : application/msword

checking /Users/albert/Dropbox/債務管理系統/NTA_IFMIS_DBM_測試及操作文件/DBM001E.docx...
Correct media type : application/vnd.openxmlformats-officedocument.wordprocessingml.document

checking /Users/albert/Dropbox/庫務組/交付文件(第二階段)/PDS/NTA_PSR_FMS_PDS_附件/NTA_PSR_FMS_PDS_1030901_需求追溯表.xlsx...
Correct media type : application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

checking /Users/albert/Dropbox/庫務組/測試.txt...
Correct media type : text/plain

checking /Users/albert/Dropbox/庫務組/ads100fa.jpg...
Correct media type : image/jpeg

checking /Users/albert/Dropbox/eBooks/Head First Python.pdf...
Correct media type : application/pdf

checking /Users/albert/Dropbox/eBooks/The Intelligent Investor 拷貝.exe...
Wrong media type ! Expected:application/x-msdownload, Actual:image/png
Exception in thread "main" java.lang.RuntimeException: Wrong media type ! Expected:application/x-msdownload, Actual:image/png
 at albert.practice.file.FileMediaTypeTest.checkFileMediaType(FileMediaTypeTest.java:59)
 at albert.practice.file.FileMediaTypeTest.testFileMediaType(FileMediaTypeTest.java:38)
 at albert.practice.file.FileMediaTypeTest.main(FileMediaTypeTest.java:28)


Reference
[1] https://tika.apache.org/
[2] http://www.iana.org/assignments/media-types/media-types.xhtml

No comments:

Post a Comment